feat: RSPEED-2538 add optional verbose metadata to /v1/infer endpoint#1305
feat: RSPEED-2538 add optional verbose metadata to /v1/infer endpoint#1305tisnik merged 5 commits intolightspeed-core:mainfrom
Conversation
Add development/testing feature to return extended debugging metadata from /v1/infer endpoint, similar to /v1/query. Requires dual opt-in: config flag (allow_verbose_infer) and request parameter (include_metadata). When enabled, returns tool_calls, tool_results, rag_chunks, referenced_documents, and token counts. Maintains backwards compatibility by excluding null fields from standard responses. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds an opt-in verbose metadata mode to the /infer endpoint: a new config flag and request field gate returning full LLM response metadata (tool calls/results, RAG chunks, referenced documents, token usage, turn summary) alongside standard text/request_id; route now uses response_model_exclude_none=True. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant API as API (/infer)
participant Llama as LlamaStack
participant Builder as Response Builder
Client->>API: POST /infer (include_metadata=true?)
API-->>API: evaluate config.allow_verbose_infer && request.include_metadata
alt verbose enabled
API->>Llama: request full inference (full response)
Llama-->>API: full response (output, usage, tool_calls, rag_chunks, referenced_documents)
API->>Builder: build_turn_summary(response.output)
Builder-->>API: turn_summary
API-->>Client: RlsapiV1InferResponse { text, request_id, tool_calls, tool_results, rag_chunks, referenced_documents, input_tokens, output_tokens, turn_summary }
else verbose disabled
API->>Llama: request simple inference (text)
Llama-->>API: simple text
API-->>Client: RlsapiV1InferResponse { text, request_id }
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/app/endpoints/rlsapi_v1.py (1)
464-483: Consider extracting shared response creation logic.The verbose path duplicates the
client.responses.create()call that exists inretrieve_simple_response(). Consider refactoring to avoid duplication.♻️ Proposed refactor to reduce duplication
async def retrieve_simple_response( question: str, instructions: str, tools: Optional[list[Any]] = None, model_id: Optional[str] = None, -) -> str: + return_full_response: bool = False, +) -> str | OpenAIResponseObject: """Retrieve a simple response from the LLM for a stateless query. - - Uses the Responses API for simple stateless inference, consistent with - other endpoints (query, streaming_query). - - Args: - question: The combined user input (question + context). - instructions: System instructions for the LLM. - tools: Optional list of MCP tool definitions for the LLM. - model_id: Fully qualified model identifier in provider/model format. - When omitted, the configured default model is used. - - Returns: - The LLM-generated response text. - - Raises: - APIConnectionError: If the Llama Stack service is unreachable. - HTTPException: 503 if no default model is configured. + ... + Args: + ... + return_full_response: If True, return the full OpenAIResponseObject. + + Returns: + The LLM-generated response text, or full response object if requested. """ client = AsyncLlamaStackClientHolder().get_client() resolved_model_id = model_id or await _get_default_model_id() logger.debug("Using model %s for rlsapi v1 inference", resolved_model_id) response = await client.responses.create( input=question, model=resolved_model_id, instructions=instructions, tools=tools or [], stream=False, store=False, ) response = cast(OpenAIResponseObject, response) - extract_token_usage(response.usage, resolved_model_id) - return extract_text_from_response_items(response.output) + if return_full_response: + return response + + extract_token_usage(response.usage, resolved_model_id) + return extract_text_from_response_items(response.output)Then in
infer_endpoint:- if verbose_enabled: - client = AsyncLlamaStackClientHolder().get_client() - response = await client.responses.create( - input=input_source, - model=model_id, - instructions=instructions, - tools=mcp_tools or [], - stream=False, - store=False, - ) - response = cast(OpenAIResponseObject, response) - response_text = extract_text_from_response_items(response.output) - else: - response = None - response_text = await retrieve_simple_response(...) + if verbose_enabled: + response = await retrieve_simple_response( + input_source, instructions, tools=mcp_tools, + model_id=model_id, return_full_response=True + ) + response = cast(OpenAIResponseObject, response) + response_text = extract_text_from_response_items(response.output) + else: + response = None + response_text = await retrieve_simple_response( + input_source, instructions, tools=mcp_tools, model_id=model_id + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/endpoints/rlsapi_v1.py` around lines 464 - 483, The verbose and non-verbose branches duplicate the client.responses.create call; refactor by extracting the shared response-creation logic into a single helper (for example a new function used by infer_endpoint and retrieve_simple_response) that calls AsyncLlamaStackClientHolder().get_client().responses.create with parameters (input_source, model_id, instructions, tools, stream, store) and returns the response object/text; then have infer_endpoint call that helper (or have retrieve_simple_response delegate to it) and remove the duplicated client.responses.create usage in the verbose branch so both paths share the same implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/app/endpoints/rlsapi_v1.py`:
- Around line 464-483: The verbose and non-verbose branches duplicate the
client.responses.create call; refactor by extracting the shared
response-creation logic into a single helper (for example a new function used by
infer_endpoint and retrieve_simple_response) that calls
AsyncLlamaStackClientHolder().get_client().responses.create with parameters
(input_source, model_id, instructions, tools, stream, store) and returns the
response object/text; then have infer_endpoint call that helper (or have
retrieve_simple_response delegate to it) and remove the duplicated
client.responses.create usage in the verbose branch so both paths share the same
implementation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 534d060a-0ac8-4214-8bda-9ec1b42ff84d
📒 Files selected for processing (4)
src/app/endpoints/rlsapi_v1.pysrc/models/config.pysrc/models/rlsapi/requests.pysrc/models/rlsapi/responses.py
|
@Lifto This looks okay for something temporary, but it sure would be nice if you added some tests. 😉 |
…onse Made-with: Cursor
tests added, PR uses template, CI passes |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/app/endpoints/test_rlsapi_v1.py (1)
650-702: Add the missing dual-opt-in negative-path test.This covers the happy path well, but it should also verify that metadata is not returned when
include_metadata=Trueandallow_verbose_infer=False(to lock in the dual-gate contract).Suggested test addition
+async def test_infer_include_metadata_ignored_when_verbose_infer_disabled( + mocker: MockerFixture, + mock_configuration: AppConfig, + mock_auth_resolvers: None, +) -> None: + """Metadata should remain excluded unless both request and config opt in.""" + custom_mock = mocker.Mock() + custom_mock.allow_verbose_infer = False + custom_mock.system_prompt = "You are a helpful assistant." + config_mock = mocker.Mock() + config_mock.inference = mock_configuration.inference + config_mock.customization = custom_mock + mocker.patch("app.endpoints.rlsapi_v1.configuration", config_mock) + + mock_response = mocker.Mock() + mock_response.output = [ + _create_mock_response_output(mocker, "Response with metadata disabled.") + ] + mock_usage = mocker.Mock() + mock_usage.input_tokens = 99 + mock_usage.output_tokens = 11 + mock_response.usage = mock_usage + _setup_responses_mock(mocker, mocker.AsyncMock(return_value=mock_response)) + + infer_request = RlsapiV1InferRequest( + question="How do I list files?", include_metadata=True + ) + mock_request = _create_mock_request(mocker) + mock_background_tasks = _create_mock_background_tasks(mocker) + + response = await infer_endpoint( + infer_request=infer_request, + request=mock_request, + background_tasks=mock_background_tasks, + auth=MOCK_AUTH, + ) + + assert response.data.tool_calls is None + assert response.data.tool_results is None + assert response.data.rag_chunks is None + assert response.data.referenced_documents is None + assert response.data.input_tokens is None + assert response.data.output_tokens is None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/app/endpoints/test_rlsapi_v1.py` around lines 650 - 702, Add a new unit test mirroring test_infer_include_metadata_returns_verbose_response but with the module-level configuration's customization.allow_verbose_infer set to False; construct an RlsapiV1InferRequest(include_metadata=True), call infer_endpoint (same mocks as the existing test), and assert the returned RlsapiV1InferResponse does NOT include metadata fields (tool_calls, tool_results, rag_chunks, referenced_documents) and that token fields are absent or zero as appropriate, ensuring the dual-opt-in behavior enforced by configuration.customization.allow_verbose_infer is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/app/endpoints/test_rlsapi_v1.py`:
- Around line 650-702: Add a new unit test mirroring
test_infer_include_metadata_returns_verbose_response but with the module-level
configuration's customization.allow_verbose_infer set to False; construct an
RlsapiV1InferRequest(include_metadata=True), call infer_endpoint (same mocks as
the existing test), and assert the returned RlsapiV1InferResponse does NOT
include metadata fields (tool_calls, tool_results, rag_chunks,
referenced_documents) and that token fields are absent or zero as appropriate,
ensuring the dual-opt-in behavior enforced by
configuration.customization.allow_verbose_infer is validated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: be568bdc-eb2b-4252-84ae-52ec181b944d
📒 Files selected for processing (1)
tests/unit/app/endpoints/test_rlsapi_v1.py
Made-with: Cursor
…payload Made-with: Cursor
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/app/endpoints/rlsapi_v1.py`:
- Around line 462-475: Verbose branch calls
AsyncLlamaStackClientHolder().get_client().responses.create(...) and uses
extract_text_from_response_items(response.output) but does not record token
metrics; update the verbose path to call extract_token_usage(response.usage,
model_id) after casting the response (same place as
extract_text_from_response_items) so metrics recorded by
_increment_llm_call_metric are executed (mirror what retrieve_simple_response
does) ensuring consistent token usage tracking for verbose requests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9333c776-7f08-481b-9073-1915a75af231
📒 Files selected for processing (5)
src/app/endpoints/rlsapi_v1.pysrc/models/config.pysrc/models/rlsapi/requests.pysrc/models/rlsapi/responses.pytests/unit/app/endpoints/test_rlsapi_v1.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/models/config.py
- src/models/rlsapi/requests.py
|
|
||
| # For verbose mode, retrieve the full response object instead of just text | ||
| if verbose_enabled: | ||
| client = AsyncLlamaStackClientHolder().get_client() | ||
| response = await client.responses.create( | ||
| input=input_source, | ||
| model=model_id, | ||
| instructions=instructions, | ||
| tools=mcp_tools or [], | ||
| stream=False, | ||
| store=False, | ||
| ) | ||
| response = cast(OpenAIResponseObject, response) | ||
| response_text = extract_text_from_response_items(response.output) |
There was a problem hiding this comment.
Verbose path may skip token usage metrics recording.
The non-verbose path calls retrieve_simple_response(), which invokes extract_token_usage() to record token usage metrics (via _increment_llm_call_metric()). The verbose path directly calls client.responses.create() but does not call extract_token_usage(), so token usage metrics may not be recorded for verbose requests.
Consider calling extract_token_usage(response.usage, model_id) in the verbose path to ensure consistent metrics tracking:
Proposed fix
response = cast(OpenAIResponseObject, response)
response_text = extract_text_from_response_items(response.output)
+ extract_token_usage(response.usage, model_id)
else:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/app/endpoints/rlsapi_v1.py` around lines 462 - 475, Verbose branch calls
AsyncLlamaStackClientHolder().get_client().responses.create(...) and uses
extract_text_from_response_items(response.output) but does not record token
metrics; update the verbose path to call extract_token_usage(response.usage,
model_id) after casting the response (same place as
extract_text_from_response_items) so metrics recorded by
_increment_llm_call_metric are executed (mirror what retrieve_simple_response
does) ensuring consistent token usage tracking for verbose requests.
Co-authored-by: Claude 4.5 Sonnet
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unit/app/endpoints/test_rlsapi_v1.py (2)
656-708: Add a serialized-shape assertion for the verbose path.You already validate non-
Nonefields; consider also asserting the serializeddatakeys to pin the public payload contract end-to-end.Suggested test hardening
assert response.data.input_tokens == 42 assert response.data.output_tokens == 18 + data_keys = set(response.model_dump(exclude_none=True)["data"].keys()) + assert data_keys == { + "text", + "request_id", + "tool_calls", + "tool_results", + "rag_chunks", + "referenced_documents", + "input_tokens", + "output_tokens", + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/app/endpoints/test_rlsapi_v1.py` around lines 656 - 708, Add a serialized-shape assertion in test_infer_include_metadata_returns_verbose_response to lock the public payload keys by converting the returned response data to its dict/serialized form (e.g., via response.data.dict() or response.dict()["data"]) and asserting the expected set of keys is present; specifically ensure keys like "text", "request_id", "tool_calls", "tool_results", "rag_chunks", "referenced_documents", "input_tokens", and "output_tokens" are included alongside any other mandatory keys your API returns so the verbose path contract is asserted end-to-end.
710-753: Also assert key omission wheninclude_metadata=Truebut config disables verbose.This branch is the security/compatibility edge of dual opt-in; adding a serialized-key check ensures metadata is truly excluded in output, not just
Nonein the model instance.Suggested assertion to lock response shape
assert response.data.input_tokens is None assert response.data.output_tokens is None + data_keys = set(response.model_dump(exclude_none=True)["data"].keys()) + assert data_keys == { + "text", + "request_id", + }, f"Expected only text and request_id, got {data_keys}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/app/endpoints/test_rlsapi_v1.py` around lines 710 - 753, The test test_infer_include_metadata_ignored_when_verbose_infer_disabled should also assert that the serialized response omits metadata keys, not just that model attributes are None; after calling infer_endpoint, serialize the response (e.g., response.dict() or response.json() into a dict) and assert that keys ["tool_calls","tool_results","rag_chunks","referenced_documents","input_tokens","output_tokens"] are not present under the "data" object; update this test to perform that serialized-key absence check so infer_endpoint / RlsapiV1InferRequest behavior is enforced end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/app/endpoints/test_rlsapi_v1.py`:
- Around line 656-708: Add a serialized-shape assertion in
test_infer_include_metadata_returns_verbose_response to lock the public payload
keys by converting the returned response data to its dict/serialized form (e.g.,
via response.data.dict() or response.dict()["data"]) and asserting the expected
set of keys is present; specifically ensure keys like "text", "request_id",
"tool_calls", "tool_results", "rag_chunks", "referenced_documents",
"input_tokens", and "output_tokens" are included alongside any other mandatory
keys your API returns so the verbose path contract is asserted end-to-end.
- Around line 710-753: The test
test_infer_include_metadata_ignored_when_verbose_infer_disabled should also
assert that the serialized response omits metadata keys, not just that model
attributes are None; after calling infer_endpoint, serialize the response (e.g.,
response.dict() or response.json() into a dict) and assert that keys
["tool_calls","tool_results","rag_chunks","referenced_documents","input_tokens","output_tokens"]
are not present under the "data" object; update this test to perform that
serialized-key absence check so infer_endpoint / RlsapiV1InferRequest behavior
is enforced end-to-end.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 274d33e0-9b77-475a-a383-71c22b97fe37
📒 Files selected for processing (1)
tests/unit/app/endpoints/test_rlsapi_v1.py
Description
Add development/testing feature to return extended debugging metadata from
/v1/inferendpoint, similar to/v1/query. Requires dual opt-in: config flag (allow_verbose_infer) and request parameter (include_metadata).When enabled, returns:
tool_calls— MCP tool calls made during inferencetool_results— Results from MCP tool callsrag_chunks— RAG chunks retrieved from documentationreferenced_documents— Source documents referencedinput_tokens/output_tokens— Token usageMaintains backwards compatibility by excluding null fields from standard responses (via
response_model_exclude_none=True).Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
/v1/inferwithoutinclude_metadata(or with configallow_verbose_infer: false). Response contains onlytextandrequest_id; no metadata fields.include_metadata: truebut configallow_verbose_infer: falsereturns standard response only.allow_verbose_infer: truein config andinclude_metadata: truein the request, response includestool_calls,tool_results,rag_chunks,referenced_documents,input_tokens,output_tokens.test_infer_minimal_requestasserts standard response has onlytext/request_idand no metadata;test_infer_include_metadata_returns_verbose_responseasserts verbose path returns all metadata fields and token counts. Run:uv run pytest tests/unit/app/endpoints/test_rlsapi_v1.py -v -k "infer_minimal or infer_include_metadata".Summary by CodeRabbit
New Features
Tests